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Abstract 



A main goal of regression is to derive statistical conclusions on the conditional distribution of 
the output variable Y given the input values x. Two of the most important characteristics of a 
single distribution are location and scale. Support vector machines (SVMs) are well established to 
estimate location functions like the conditional median or the conditional mean. We investigate the 
estimation of scale functions by SVMs when the conditional median is unknown, too. Estimation 
of scale functions is important e.g. to estimate the volatility in finance. We consider the median 
absolute deviation (MAD) and the interquantile range (IQR) as measures of scale. Our main 
result shows the consistency of MAD-type SVMs. 

1 Introduction 

Let P be the distribution of a pair of random variables {X, Y) with values in a set X x y where X 
is an input variable and K is a real- valued output variable. The goal in regression problems is to 
derive statistical conclusions on the conditional distribution of Y given X — x. Generally, location 
and scale are considered as the two most important characteristics of a distribution and estimating 
these quantities is one of the main topics in statistics. 

Regularized empirical risk minimization [26l EZl UHl H] using the kernel trick proposed by [19] and 
the special case of support vector machines (SVMs) [26l [6l [HI [23] are well established methods in 
order to estimate the location of the conditional distribution of Y given X = x. For an i.i.d. sample 
D ~ ((Xi, Yi), . . . , {Xn, Yn)) drawn from P, the SVM-estimator is defined by 



where L is a loss function, H is a certain space - a so-called reproducing kernel Hilbert space (RKHS) 
- of functions / : A" — )■ R, and A £ (0, oo) is a regularization parameter in order to prevent overfitting. 
We refer to [371 [HI [21 ISl [22 for the concept of an RKHS. There are a number of different quantities 
which describe the location of a single distribution and which can be estimated by SVMs by choosing 
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a suitable loss function. The conditional mean function g{x) :— Ep[Y\X ~ x], x £ X can be estimated 
by using the least-squares loss LLsiv, t) = {jj ^^^d the r-quantile function g{x) :— /* p(x), x G X, 
(see ([2]) below) by using the r-pinball loss function 



Lr-piniV, t) 



(l-T).(t-y) ify-t<0, 

T-{y-t) iiy-t>0, {y,t)eyxR, 



see [121 HH 1201 12S] ■ The choice r = 0.5 leads to an estimate of the median function 

/o5p(^) medianp(F|X = x), x^ X. 

The goal of this paper is to investigate two methods to estimate the variability of the conditional 
distributions of Y given X = x for x ^ X via scale functions. Estimation of heteroscedasticity 
is interesting in many areas of applied statistics, e.g., for the estimation of volatility in finance. 
To fix ideas, let us illustrate what we mean by scale function estimation by considering a small 
data concerning the so-called LIDAR technique. LIDAR is the abbreviation of Light Detection And 
Ranging. This technique uses the reflection of laser-emitted light to detect chemical compounds in 
the atmosphere. We consider the logarithm of the ratio of light received from two laser sources as 
the output variable Y = logratio, whereas the single input variable X = range is the distance 
traveled before the light is reflected back to its source. We refer to [17] for more details on this data 
set. A scatterplot of the data set consisting of n — 221 observations is shown in the left subplot 
of Figure [T] together with the fitted quantile curves based on SVMs using the pinball loss function 
for T e {0.05,0.25,0.5,0.75,0.95} and the Gaussian RBF kernel k{x,x') := exp(-7||a; - x'\\l) for 
x,x' G X. By looking at the estimated median function (i.e., the black curve in the middle of the 
left subplot), we clearly see a nonlinear relationship between both variables which is almost constant 
for values of range below 550 and decreasing for higher values of rsinge. However, there is also a 
considerable change of the variability of logratio given range: the variability is relatively small for 
small values of range, but much larger for large values of r singe. This becomes obvious by looking 
at the other estimated quantile curves in the left subplot or by looking at the right subplot of Figure 
[1] which shows the estimated width of intervals covering at least 50% of the mass of P(F|a;). In this 
simple example we can just look onto the 2-dimensional plot to realize this kind of heteroscedasticity 
of the conditional distribution of Y given X = x. However, this is obviously no longer possible if the 
input space A" is a high-dimensional Euclidean space or an abstract metric space. Hence an automatic 
and non-parametric method to model and to estimate such kind of variability becomes important. 
Therefore, this article investigates how two classical scale quantities of the conditional distribution 
of Y given X = x can be estimated by use of SVMs. Such scale functions g : X ^ [0, 00) are quite 
common in a heteroscedastic model like P(F|a;) = f{x) +g{x)e, where / denotes the location function 
and e denotes the stochastic error term. Note, that we will not assume such a specific model. As 
in case of location, there are several well established quantities which describe the scale, e.g., [101 
Chap. 5] 

(i) the variance function: g{x) = Varp(y|X = x), x € A", 

(m) the median absolute deviation from the median (MAD) function: 

g{x) := MADp(r|A: = x) median(|y - /o.5,p(2^)I \ X ^ x), x e X, 
{Hi) the interquantile range (IQR) function for quantiles ti < T2: 
gix) IQR,^^,,(y|X = x) f*^^pix) - /;^p(x), x e X. 
Note that these three quantities are not directly comparable. However, IQRo.25,0.75 ^^'^ 2 times 
MAD are both quantities for the width of an interval covering at least 50% of the probability mass 
of P(y|a;). There is a vast literature on the estimation of scale functions, often based on special 
parametric dispersion models, see, e.g., [TTJ [3T| [TS], and for a wavelet thresholding approach for 
univariate regression models we refer to '3]. 

In this article, we consider the MAD function and the IQR function and show how both can 
be consistently estimated in a purely nonparametric manner with SVMs. In case of the MAD, we 
estimate the unknown median function /o5p by an SVM /lq 5_pi„,D,A and calculate the estimated 
absolute residuals Ri :— \Yi — 5_pi„,D,A(Ari)| in a first step. In a second step, we estimate the 
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Figure 1: Illustration of the estimation for scale functions by SVMS for the LIDAR data set. Left 
subplot: data set, estimated quantile functions with SVMs for r — 0.5 (black), r — 0.25 and t = 0.75 
(both in blue), r = 0.05 and r = 0.95 (both in red). Right subplot: Estimated width of the intervals 
covering 50% of the mass of P{Y\x). IQR-type SVM (blue) using (ti,T2) = (0.25,0.75) and 2 times 
the MAD-type SVM (green). 
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conditional median of the absolute residuals by the SVM based on a smoothed version of the |- 
pinball loss defined in Q below for the pairs of random variables {X^, Ri). The resulting estimator is 
called MAD-type SVM and it is shown in Subsection 2.1 that it is risk-consistent (up to any predefined 



e > 0) even though (i) the estimation in the second step cannot be based on the true residuals but 
has to be based on the estimated residuals because the true median function is unknown and (ii) the 
random variables {Xi,Ri) are not i.i.d. In case of the IQR f*^ p — /*^ p, we respectively estimate 
the Tj-quantile function /*. p by use of the Tj-pinball loss so that we get /L^2_pi„,D,A2 ~ /LTj_pi„,D,Ai as 
an estimate, which we call IQR-type SVM. As this is the difference of two standard SVMs, we can 
carry over many well-known facts on SVMs in this case in Subsection 2.2 In both cases, available 



software, e.g., the R-package kernlab [T^ or the C-I--I- implementation mySVM [16 , can be used since 
we essentially have to calculate SVMs for pinball losses. 

The rest of the paper has the following structure. Section 2 contains with Theorem |2.2| our main 
result. Section 3 contains not only the proof of this theorem, but also gives two new consistency results 
in the Li -sense for SVMs based on the pinball loss, see Lemma |3.1| and Theorem |3.2| Although we 
need these results in our proof of Theorem ] 2. 2 [ we think that they are interesting in its own, because 
they improve earlier consistency results of SVMs which showed the weaker kind of convergence in 
probability, see O Cor. 3.62, Thm. 9.7(i)]. 



2 Main results 

The following assumptions and notations are used throughout the whole article. 

Assumption 2.1 Let X be a complete separable metric space, e.g. X — , andy C 11 6e closed. For 
j G {1,2}, let kj : X X X R be a bounded continuous kernel with ||fcj||oo :— sup^^;:^{kj{x, x)Y^'^ < 
CO. Its corresponding reproducing kernel Hilbert space (RKHS) is denoted by Hj, its corresponding 
canonical feature map is denoted by and it is assumed that each Hj is dense in Li{fi) for every 
HC,Mi[X X 3^). 

M.i{X xy) denotes the set of all Borel probability measures on A" x 3^. The unknown joint distribution 
of {X, Y) is denoted by P e Mi{X x 3^) and D = ((ATi, Yi), . . . , {X^, Yn)) is an i.i.d. sample drawn 
from P. Let Vx denote the distribution of X, let Co{X) denote the set of all Borel measurable 
functions f : X ^ M. and let Li{¥x) denote the set of all PAr-integrable functions / : A" — >• R. We 



3 



define the r-quantile function as tlie (perliaps set-vaiued) function 

X 2^ , x^F;p{x):={t£^:V{{-oo,t]\x)>Ta.\vlV{[t,(x,)\x)>l-T}. (2) 

We malte tlie standard assumption that F*^{x) are singletons and hence we write f*p{x) : X ^ H 
instead, see m HI] . 

2.1 MAD-type estimation 

We would like to estimate the MAD function given by g{x) = MADp(y|X = x) = medianp(|y — 
/o 5 p(^)l I "''^ = ^) where /□ 5 p is the median function. First, we estimate the median function /g 5 p- 
For a random sample D = ((Xi, Yi), . . . , (X„, y„)) drawn from P, the SVM-estimator for /g g p is 

/I " 

/Lo.5-pi„,D,Ai,„ = arg inf - ^ io.5-pin(i^i, /(^»)) + Ai,„|i/|l^^ 

^ i—l 

^i,n ^ (Oi^o), and Hi is an RKHS. Then, we can estimate the conditional median of the absolute 
residuals \Y — /q 5 p(^)| by use of the estimated absolute residuals. Let us define 

gD,„ = arg inf f - - /i„.,_^„,D,A,„ (X,) |, + A2,„|l5ll?^A (3) 

ge-H2 \ n — •• J 

where, for some small predefined number e > 0, the loss function defined by 

L,(y,t) - i(y-i)-elog(2A(^)) = L„.5_pi„(y, _ e log (2A(^)) , (4) 

is an e-smoothed version of the pinball loss function for r = 0.5, A(r) = 1/(1 + e^*") for every r € R, 
A2,n G (OjOo), and -ff2 is an RKHS. Since 5D,n occasionally can have negative values, we propose the 
MAD-type estimator 

5D,n = max{gD,n,0} (5) 

instead of gD,n- We use the smoothed version of the pinball loss function ig. 5-pin because we will 
need in the proof of Theorem 2.2 that the loss function has a Lipschitz continuous derivative, see (18) 
and (19 1. This is the price we pay for the unavoidable fact that our estimation cannot be based on 
the true residuals but on the estimated ones because the distribution P of {Xi.Yi) is assumed to be 
unknown in statistical machine learning. Some easy calculations show that the smoothed pinball loss 
function is convex, Lipschitz continuous with \Le\\ — 0.5, has a Lipschitz continuous derivative, 
and fulfills < ^•Q.^-^x■Jyy^i) ~ L^{y,t) < log(2)£ < s for every {y,t) ^ y xH and the risks fulfill, for 
aU P e Mi{y X R), 

< EpLo.5-pUYJ{X))-EpL,{YJiX)) < Ep\Lo.ro-pin{YJ{X))-L,iY,f{X))\<e. 
The e-smoothed version of the pinball loss is actually a re-parametrized logistic loss function L^{y^ t) = 

^^logistic 

(y/e, i/e)/2, see [23 p. 44] Hence the SVM based on can be calculated by any software 
which supports the logistic loss. For the illustration purposes in the introduction, we used e = 0.1 
and calculated ([3| by Newton-Raphson. 

For any loss function L and every measurable /, g : A" — > R, define the risk 

7^L,p(/,g) := EpL{\Y-fiX)\,g{X)) . (6) 

If the median function /g 5 p and the MAD function gp{x) = MAD(y|Ar = x) uniquely exist, then 
the MAD function minimizes g 1— > TZlq 5-pi„.p(/o 5 p:ff) over all measurable functions g : A" — > R, 
i.e. 

^io.5-p„„p(/o.5.P'ffp) = ^7^Lo.5_p,„,p(/g*5.p,5) ■ 

The following theorem says that 50. n is risk e-consistent for the MAD function. 
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Theorem 2.2 In addition to Assumption 2.1 assume that Ep|y| < oo and that the median function 
/q 5 p : A" — > R is almost surely unique. Let Lq denote the 0.5-pinball loss function and let e > be 
the predefined real number in the loss function L^. Then, for n — )■ oo, 

l'^L^^^'0■P(^*5,P'.9) + £ > "^Lo.pI/o.S.P^SD,™) +Op(l) = 7^Lo,p(/Lo,D,Al,„,3D,n) +Op(l) 

if lim„^oo Ai,„ = 0, lini„^oo >^2ai = 0; and lim„^oo A? „n = oo. 

Remarks: (i) We assume that the true median function uniquely exists but we do not assume that 
the true MAD function uniquely exists, (ii) The value T^Lo,p{fo 5 piffD.n) quantifies the expected 
distance of the estimate gu,n to the absolute values of the true residuals 1^ — /q 5 pl-'^)!; the value 
^io,p(/Lo,D.Ai „,5D,n) quantifies the expected distance of the estimate gr),n to the absolute values of 



the estimated residuals \Y — fLo,DM According to Theorem 2.2 both values asymptotically 

achieve the infimal risk up to the predefined e > 0. (iii) The assumption lim„^co ^Aj „n = 00 is 
stronger than the standard assumption lim„_j.oo = 00; see |23[ Thm. 9.6]. This is plausible because 
estimating the MAD is burdened with the estimation of a nuisance function (i.e. the unknown median 
function) . 

2.2 IQR-type estimation 



Let us now consider a linear combination of m SVMs under the Assumption 2.1 As the results follow 
by straightforward calculations using standard results on SVMs, the proofs are left out. 

Let m be a positive integer, J = {1,...,to}, c — (ci,...,Cm) G IR,™\{0}, and (Cj,ri)ne]No be a 
sequence of measurable functions into some complete separable metric space Ej enclipped with its 
Borel cr-algebra, j £ J. Obviously, '^j^j,n exists and is unique if all exist and are unique. 

Furthermore, converges to X^je j ^j^jfi probability (or almost surely or in the Lp sense) 

if all converge in probability (or almost surely or in the Lp sense) to ^j,o for n — >■ 00. 

Now, let < Ti < . . . < Tm < 1- If we either specialize that denotes the support vector 
machine /L^._pi„,D,Aj „ and choose as Ej the RKHS Hj or that „ denotes the corresponding risk 
EpLT-j-pin(y, /L^._pi„,D,Aj,„(-^)) and choose Ej = y, then existence, uniqueness and consistency re- 
sults for the linear combination of the SVMs or of their risks follow immediately from results valid 



for each individual SVM, see, e.g., [231|22ll24j and our Theorem 3.2 Denote the subdifferential (see 
e.g. [Zl Section 5.3]) of the pinball loss function Lr^-pin (with respect to the second argument) by 
9I/7-^._pin. We then obtain immediately a representer theorem for the linear combination of SVMs 
because it is well-known that each individual SVM has a representer theorem, i.e. it holds 

where the functions hj^p fulfill 

/ij-p(x,y) e aL^^._pi„(y,/L,^„pi„,P,A,,.(a;)) y{x,y)eXxy, (8) 

see e.g. [23 Thm. 5.8, Cor. 5.11]. In the same manner we obtain by straightforward calculations 
bounds for the maximal bias of SVMs and Bouligand influence functions for linear combinations of 
SVMs, see 14]. Note that SVMs exist even for heavy-tailed distributions which violate the classical 
assumption EpjF] < 00, which can be shown by using a trick already used by [3] where instead of the 
original loss function a shifted loss function in the sense L*_pi„(y,i) := Lr-piniy,t) — Lr-pin(y, 0), 
y,t gH, is used, see [S]. This only changes the objective function to be minimized, but not the SVM 
itself. 

Example 2.3 [Estimation of scale functions.] Let m — 2, c — (—1, +1), Ti £ (0, |) and T2 G (|, 1), 
e.g. {ti,T2) — (j, |) or {ti,T2) = (0.05,0.95). Then we obtain immediately existence, uniqueness 
and consistency results for the difference of the two SVMs based on pinball loss functions L-r^-pin cmd 
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^Ti-pin; respectively. In other words, if we denote a Tj-quantile of the conditional distribution of Y 
given X — x by f*, p, then the following difference of two SVMs 

/l^2— pin, 0,^2, n y^Tx -pin ,D, 

yields an estimator for the difference of f*^ p ^ /ri p ■ 

Example 2.4 [Estimation of asymmetry functions.] Let m = 3, c — (+1,— 2,+l), Ti G (0, |), 
T2 = 5, and T3 e (5,1), e.g. (ri,r2,r3) = or (ri,T2,T3) = (0.05,0.5,0.95). Then we obtain 

immediately existence, uniqueness and consistency results for 

/LT3_pi„,D,A3.„ ~ 2/L^2_pi„,D,A2,„ + /LTi_pi„,D,Ai,„, 

which gives us an estimator for the difference of 

(/;,?- /;,p)-(/;,p-/;,p)- (9) 

Let us now choose t G (0, ^) and ri = 1 — T3 = r, e.g. t = 0.05. Then the function in (T^D is zero, if, 
for all X £ X, the upper conditional quantile fl_^-p{x) differs from the conditional median /o5p(a;) 
by the same amount than the conditional median /q 5 p {x) differs from the lower conditional quantile 
f*p{x). Hence the function in or its supremum norm can be used as a quantity to measure the 
amount of asymmetry of the coriaitional distribution of Y given X = x. -4 

It is "well-known that the so-called crossing problem can occur in quantile regression and that this 
problem is not specific to SVMs, see [l4t p. 55-59]. The crossing problem occurs if, for two quantile 
levels Ti < T2, the estimated quantile functions , 9x2 are in the wrong order for at least one a; G A", 
i.e. (jriix) > qr^ix). The danger that the crossing problem occurs for a fixed data set is typically 
small if Ti is close to and if T2 is close to 1. A numerical method to prevent the crossing problem in 
kernel based quantile regression was proposed by [25] . 

2.3 Comparison of MAD-type and IQR-type estimation 

From our point of view, it will often depend on the application whether an MAD- or an IQR-type 
SVM is more appropriate. 

We see three advantages of MAD-type estimation, (i) One can estimate the heteroscedasticity of 
P(-|a;) by estimating the conditional median of the absolute residuals \Y — f{x)\ without estimating 
two conditional quantile functions. Because in most applications the conditional median (or the 
conditional mean) are estimated anyway, one only needs to compute one additional SVM instead of 
two additional SVMs by the IQR-type approach, (ii) It can happen that the upper and the lower 
quantile functions are hard to approximate, e.g., they are not in the RKHSs Hi and H2 which can 
easily happen even with the classical Gaussian RBF kernel whose RKHS contains only continuous 
functions, see [131 Lem. 4.28, Cor. 4.36] whereas the true quantile functions may have jumps, (iii) It 
can happen that the difference of two quantile functions is easy to estimate, e.g. it is constant, linear, 
or a polynomial of low order, although the quantile functions f*^ p and f*^ p are complicated. 

On the other hand, we see three advantages of IQR-type estimation: (i) Greater fiexibility by the 
choice of {ti,T2) whereas the MAD-type approach is based on estimating one conditional quantile 
(which is here r = ^) of the distribution of the absolute residuals, (ii) Greater fiexibility by choosing 
different types of kernels or kernels with different kernel parameters for estimating the upper and the 
lower quantile functions, (iii) The IQR-type approach allows the direct estimation of asymmetry or 
other quantities of interest for the distribution of Y given X = x. 

3 Proofs 

3.1 Li consistency of quantile function estimation by SVMs 

The following lemma strengthens [23, Cor. 3.62] in case of the pinball loss function as convergence in 
probability is replaced by the stronger Li-convergence. 
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Lemma 3.1 Let L be the pinball loss with t G (0, 1) and let P G A4i{X x y) be the distribution of 
{X,Y). Assume that Ep\Y\ < oo and that the conditional quantile function /*p : A" — > R is Px-a.s. 
unique. Then, for every /„ G Li{Px), n £K, we have 

lim 7^L.p(/„) = inf 7?.l,p(/) lim ||/n~/rp||r ,p = • 

Proof: Define /i„ : A" x 3^ — > R by hn{x,y) — L(y,fn{x)) and Hq : X x y ^ R by hQ{x,y) — 
L(?/, /*p(x)). Define c := min{f — r, r} and note that L{y,t) > c\y — t\ for every {y,t) G 3^ x R. 
According to Cor. 3.62], it is already known tliat /n — >■ /^p in probability (w.r.t. Px)- Therefore, 
it follows from the continuity of L that /i„ — >■ /lo in probability (w.r.t. P). Since 

lim J\h„\dP = lim 7^L.p(/„) - inf 7^L,p(/) - 7^L.p(/;p) = J\ho\dP , 

n->-oo n->-oo /e£o('^) 

the sequence (/in)rieiN, is uniformly integrable; see e.g. [U Thm. 21.7]. Since 

\fn{x)\ < \y ~ fn{x)\ + \y\ < c-^L{y,fn{x)) + \y\=c-^hjx,y) + \y\ V {x,y,n) e X x y x K, 

it follows that the sequence /„, n G IN, is uniformly integrable, too. Hence convergence in probability 
of /„, n G IN, imphes Li-convergence; see e.g. [J Thm. 21.7]. ■ 

The following theorem strengthens |231 Thm. 9.7(i)] as convergence in probability is replaced by the 
stronger Li-convergence. The proof coincides with that of |23l Thm. 9.7(i)] apart from applying 
Lemma 3.1 instead of [331 Cor. 3.62] and therefore is omitted. 

Theorem 3.2 Let X be a complete measurable space, 3^ C R &e closed, L be the pinball loss with 
T G (0,1), H be a separable RKHS of a bounded kernel k on X such that H is dense in Li{fi) for 
all ^ G A4i{X), and A„ G (0,oo), n G IN, such that lim„_i.oo A„ — and lim„_>.co A^n = oo. Let 
P G A4i{X X y) be the distribution of (X,Y) and assume that Ep\Y\ < oo and that the conditional 
quantile function /*p : A" — !■ R is Px-a.s. unique. Then, 

||/L,Da„ - /r,p||ij(P^) in probability, n oo. 

3.2 Proof of Theorem [272] 

In order to increase the readability of the proof, a comprehensive notation is needed. Therefore, we 
define Lq -Lg.s-pin and, for probability measures Pi,P2 G Mi{X x y), we define 

/Pi;n := /lo,Pi,Ai.„ = arg ^inf ^ ^ ^ Lo{y, f{x)) Pi {d{x,y)) + Xi,n\\f\\H,^ , 

5Pi,P2;n ■= ^^'Sg^Jll, (/ ^'^{\y ^ fPi-A^)\'9{x))P2{d{x,y)) + A2,„||g||^^j . 

In this definition. Pi and P2 can also be equal to the empirical measure D = ^ Xir=i ^iXi,Yi)-, which 
corresponds to the random sample D = (^{Xi,Yi), . . . , (X„,y„)). That is, the estimate gu.n defined 
in ([5]) and ([s]), is given by gD,n = niax{gD.D;n, 0} in this notation. We obtain 

4(y,i) := iie(2/,t) - ^-A(^) Vt/,tGR. 

Since \ -^Lg{y,t)\ < 0.5 and \ -g^L^{y,t)\ < 0.5 for every y,t G R, the following Lipschitz property is 
fulfilled ^ 

\L{yi,ti)-L{y2,t2)\ < 0.5\yi-y2\+0.5\ti-t2\ Vyi, y2, ^i, ^2 e R . (10) 
An easy calculation shows that ( [To| implies 

|^L,,p(/l,5l) -^L.,p(/2,ff2)| < 0.5||/i -/2||Li(P;t.) +0-5||5l -.92|Ui(P;t.) (11) 
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for all /i,/2,3i,ff2 G ^i(Px)- Note that, by construction, < Lq — < e, which implies 

7eL„p(/,5) < 7^Lo,p(/,g) < 7^L,,p(/,.9) +e V/,g e £i(P;,) . (12) 

It is obvious from the definition ^ of the risk 7?.Lo,p(/, g) that replacing negative values of the function 
g by reduces the risk. Hence, the definitions imply TZlo.pHo 5 9T^,n) ^ ^Lo,p(/o 5 p, 5D,D:n) and 
it follows from ( 12 ) that 



inf ^7^Lo,p(/o.5,p,5) < 

geCoiX) 



' inf 7^L^,p(/o*5P,g)+e 



(13) 



< ''^L,,p(/o.5,P!5D,D:r 

Applying the triangular inequality yields 

^L.,P(/0.5,P )-7^L„p(/p ) +7^L.,p(/p )■ (14) 

Next, define 

A (") II II A (") II II 

^1 :— ||3D,D;ri - .9P,D:n||Li(PA-) J ^2 •— ||3P,D;n - 5'P,P;ji||Li(Pa^) ) 

^3"^ — II/0.5.P - /P;"IUi(P;v) , (7^L,,p(/p;„,5P,P;rO - inf ^7^L^,p(/o.5,p,5))■ 



Then, it follows from (13), (14), (11), and another application of the triangular inequality that 

< 7^i„,p(/*5,p,5D,„) - inf ,7^i„^p(/*5_p,g) < 0.5(a(") + a'") + aH + Af^ + e . (15) 

Each of the four summands A^"-*, . . . , A^"'' will be considered separately in the following four parts. 
In order to prove the theorem, it is enough to show that A^"'' and A2"'' converge to in probability 



(Part 1 and Part 2), that Ag"^ converges to (Part 3), and that the limit superior of A4"'' is not 
larger than (Part 4); the terms Ag"-* and A4"-' are non-stochastic. Note that ( [ll| and Theorem 
imply, for n — > 00, the convergence in probability of 



3.2 



|'^Lo,p(/o.5,P'.9D,ri) -"^Lo^pl/io.D.Ai^niSD.n)! < 0-5||/o.5,p - /lo,D,Ai,, 



\L^{Px) 



^ 



Part 1: For D = ((Xi, Fi), . . . , (X„, y„)), define 



n 



(Jf.,|F.-/p,„(X.)|) 



and 



1 " 



For every (x^y) G A" x J^, define hY)^nix,y) = L'^{y, gp.u -.nix))- Then, it follows from the representer 
theorem [23l Cor. 5.10] that 



\gB,I)-n- gP.U;n\\fj^ < Aj,^ | |Eq^ /lD,ri ^"2 ~ Eqd /lD,ri *2 1 1 < 
1 " I 

< T - /d;»(^^)I) - hD,n{X^, l^'^ " /P;n(^OI) ' ||*2(^»)||^^ 



2,n« 



(16) 



According to the boundedness of ki and fc2, we will use the well-known inequalities 

\\^j{x)\\h, < \\k,\U yxeX and ||/||oo < ||fc,|loo|l/|U, V/eff, (17) 
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for every j e {1,2}; see (55] p. 124]. Then, the definition of /id.?i and the easy to prove Lipschitz 
property \L'^{yi,t) - L'^{y2,t)\ J/2I for all yi, ya, i e R imply 



5D,D;n — ffP,D:: 



\H2 



S > I^D. n(-'^i, |^i^/D;n(-'^i)|) ^ /lD,n(-'^i, |^i^/p;n(-'^i)|) 



< ||fc2||ooA2,^supsup L'^{\y-fD;n{x)\,t) - L'^{\y-fp.nix)\,t) 



t X, y 



< |lfc2|loo£ ^Aa^^SUp \y-fB-n(x)\ - \y-fP;n{^)\ < ||fc2|loo£ ^ >^2.n\\h-n- fp-r, 



(18) 

(19) 



? Ilfcil 



X, V 
--1 \-l 



Next, it follows from the representer theorem [23] Cor. 5.10] that there is an /ip „ e £oo(<%') such that 
II^P,. 



< 0.5 and 



/D;n — fP:n \ \ „ < A 



I Hi 



n 

(/ip,„(x„y,)*i(^.) -Ep/ip,„$i; 



(20) 



Define B := sup^ ^ ||/ip „(a;, t/)$i(a;)||//j^ < 0.5||fci||oo and fix any 77 > 0. Then it follows from (20) and 
Hoeffding's inequality Chapter 3] that, for n — 00, 



P"(A2",^||/D;n-/P;nL^ > V) < exp - 



because lim„_j.oo A^ „A2 „?i = 0. That is, we have shown that A] 
to in probability w.r.t. P". 

Part 2: De&ne L^.n{x,y,t) ^ L^dy ~ fp.n{x)\,t). This yields 



7?Ai,„A2^„B + 352 



0, 



5D,D;n - gP,D;n\\H2 COUVCrgeS 



5P.Pi;n := arg^inf^ (^J L,,ri{x,y, 9{x)) Pi{d{x,y)) + A2,„||g||^^^ VPi £ Mi{X x y) . 

Hence, for hp^n{x,y) = L'^{\y — fp-n{x)\,t), the representer theorem [23, Cor. 5.10] implies that 

1 " 

-y^hp,n^2-Kphp.n<^2 



5P,D;n — 3P,P;; 



< A, 



1=1 



For B := sup^. ,^ \\hp^n{x,y)^2{x)\\Hi < 0.5||fc2||oo, it foUows from Hoeffding's inequality [551 Chap. 3] 
and A2 — 00 that A2"'' = ||gp,D;n — 3P,P;n||^2 converges to in probability. 

Part 3: Since lim„_>.oo T^Lo,p{fp;n) = ^nffeCo(X) T^Lo,p{f) as shown in [SSj p. 338], it follows from 
Lemma 13.11 that 

hm A^") = hm ||/i;-/p.„|U,(P,) = 0. (21) 
Part 4: For every g G -^2, define the approximation error function (where we use the notation ([6|) 
Ag : L,{Px)xR ^ R, (/, A) ^ 7^i.,p(/, g) + AJIsIlL " mf ^7^L^,p(/o% p,5o) • 

gaeCa{X) 

Note that the assumption Epjy] < oo implies that ]Ag(/, A)] < oo such that Ag is well defined. It 
follows from the Lipschitz property ( 10 1 of that Ag is continuous for every g e H2 and, therefore, 
the map (/, A) 1— )■ inf^g/^^ ^g(/, -^) is upper semicontinuous. Hence, (21 1 implies 

limsupA^"^ < limsup inf Ag(/p:„, A2,„) < inf ^g(/o*5^p,0) = 0, 



geH2 



geH2 



where the last equality follows, because the assumption that H2 is dense in Li{Px) guarantees 
infgoeCol-v) ^L,,p(/o.5,p,.9o) = infgeff^ ^l.,p (/o.5,P' 5o) according to [231 Theorem 5.31]. ■ 
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